[% setvar title docs/pdds/pdd04_datatypes.pod - Parrot's internal data types %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='NAME'></a><h1>NAME</h1>
<p>docs/pdds/pdd04_datatypes.pod - Parrot's internal data types</p>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>This PDD describes Parrot's internal data types.</p>
<a name='DESCRIPTION'></a><h1>DESCRIPTION</h1>
<p>This PDD details the basic datatypes that the Parrot core knows how to deal
with. Three of these (the integer, floating point and string datatypes)
have no additional semantics. The fourth datatype, the Parrot Magic Cookie
(PMC) acts as the basis for all high level languages running on top of
Parrot; only the most basic aspects are described here.</p>
<p>Note that PMC and string internals are volatile and may be changed in the
future (although this will become increasingly unlikely as we near v1.0).
Access from external code to the internals of particular datatypes should
be via the extension mechanism (see <b><i>docs/pdds/pdd11_extending.pod</i></b>,
which has more explicit guarantees of stability.</p>
<a name='IMPLEMENTATION'></a><h1>IMPLEMENTATION</h1>
<a name='Integer data types'></a><h2>Integer data types</h2>
<p>Integer data types are generically referred to as <code>INT</code>s. <code>INT</code>s are
conceptual things and there is no data structure that corresponds to them.</p>
<ul>
<li><a name='Platform-native integer'></a>Platform-native integer</li>
<p>These are whatever size native integer was chosen at Parrot configuration
time. The C-level typedefs <code>INTVAL</code> and <code>UINTVAL</code> get you a
platform-native signed and unsigned integer respectively.</p>
<li><a name='Arbitrary precision integers'></a>Arbitrary precision integers</li>
<p>Big integers, or bigints, are arbitrary-length integer numbers. The
only limit to the number of digits in a bigint is the lesser of the
amount of memory available or the maximum value that can be
represented by a <code>UINTVAL</code>. This will generally allow at least 4 billion
digits, which ought to be far more than enough for anyone.</p>
<p>The internal representation of a bigint has not yet been determined.
One possible implementation is described below; for an alternative
implementation, see <b><i>docs/pdds/pdd14_bignum.pod</i></b>.</p>
<li><a name='A possible bigint implementation'></a>A possible bigint implementation</li>
<p>We can represent a bigint with:</p>
<pre>    struct bigint {
         void *buffer;
         UINTVAL length;
         INTVAL exponent;
         UINTVAL flags;
    }</pre>
<p>Should we scrap the buffer pointer and just tack the buffer on the end
of the structure? Saves a level of indirection, but means if we need
to make the buffer bigger we have to adjust anything pointing to it.</p>
<p>The <code>buffer</code> pointer points to the buffer holding the actual
number, <code>length</code> is the length of the buffer, <code>exponent</code> is the base
10 exponent for the number (so 2e4532 doesn't take up much space), and
<code>flags</code> are some flags for the bigint.</p>
</ul>
<a name='Floating point data types'></a><h2>Floating point data types</h2>
<p>Floating point data types are generically referred to as <code>NUM</code>s. Like
<code>INT</code>s, <code>NUM</code>s are conceptual things, not real data structures.</p>
<ul>
<li><a name='Platform native float'></a>Platform native float</li>
<p>These are whatever size float was chosen when parrot was configured. The
C level typedef <code>FLOATVAL</code> will get you one of these.</p>
<li><a name='Arbitrary precision decimal numbers'></a>Arbitrary precision decimal numbers</li>
<p>Arbitrary precision decimal numbers, or bignums, can have any number
of digits before and after the decimal point.</p>
<p>The internal representation of a bigint has not yet been determined.
One possible implementation is described below; for an alternative
implementation, see <b><i>docs/pdds/pdd14_bignum.pod</i></b>.</p>
<li><a name='A possible bignum implementation'></a>A possible bignum implementation</li>
<p>Bignums could be represented by the structure:</p>
<pre>    struct bignum {
        void *buffer;
        UINTVAL length;
        INTVAL exponent;
        UINTVAL flags;
    }</pre>
<p>The similarity to the suggested bigint structure is not accidental; we
want upgrading from bigint to bignum to be quick.</p>
<p>Like the bigint structure, should we toss the data pointer and just
tack the data on the end?</p>
</ul>
<a name='String data types'></a><h2>String data types</h2>
<p>Parrot has a single internal string form:</p>
<pre>    struct parrot_string_t {
        pobj_t obj;
        UINTVAL bufused;
        void *strstart;
        UINTVAL strlen;
        const ENCODING *encoding;
        const CHARTYPE *type;
        INTVAL language;
    }</pre>
<p>The fields are:</p>
<ul>
<li><a name='obj'></a>obj</li>
<p>A pointer to a Parrot object, Parrot's most general internal data type.
In this case, it holds the buffer for the string data, the size of the
buffer in bytes, and any applicable flags.</p>
<li><a name='bufused'></a>bufused</li>
<p>The amount of the buffer currently in use, in bytes.</p>
<li><a name='strstart'></a>strstart</li>
<p>A pointer to the beginning of the actual string (which may not be positioned
at the start of the buffer).</p>
<li><a name='strlen'></a>strlen</li>
<p>The length of the string, in characters.</p>
<li><a name='encoding'></a>encoding</li>
<p>How the data is encoded (e.g. fixed 8-bit characters, UTF-8, or UTF-32).
Note that this specifies encoding only -- it's valid to encode
EBCDIC characters with the UTF-8 algorithm. Silly, but valid.</p>
<p>The ENCODING structure specifies the encoding (by index number and by name,
for ease of lookup), the maximum number of bytes that a single character
will occupy in that encoding, as well as functions for manipulating strings
with that encoding.</p>
<li><a name='type'></a>type</li>
<p>What sort of string data is in the buffer, for example ASCII, EBCDIC,
or Unicode.</p>
<p>The CHARTYPE structure specifies the character type (by index number and by
name) and provides functions for transcoding to and from that character type.</p>
<li><a name='language'></a>language</li>
<p>This specifies the language corresponding to the string. This is to
allow for locale-based data to be attached to strings. To give an
example of the use of this: strings in German may not sort in the same
way as strings in French, even when both types use the Latin-1 charset
and are encoded in UTF-8.</p>
<p>Note that language-agnostic utilities are at liberty to ignore this entry.</p>
</ul>
<a name='Parrot Magic Cookies (PMCs)'></a><h2>Parrot Magic Cookies (PMCs)</h2>
<p>Parrot Magic Cookies, or PMCs, are the last of Parrot's basic datatypes,
but are also potentially the most important. Their basic structure is
as follows. All PMCs have the form:</p>
<pre>    struct PMC {
        pobj_t obj;
        VTABLE *vtable;
 #if ! PMC_DATA_IN_EXT
        DPOINTER *data;
 #endif
        struct PMC_EXT *pmc_ext;
    };</pre>
<p>where <code>obj</code> is a pointer to an <code>pobj_t</code> structure:</p>
<pre>    typedef struct pobj_t {
        UnionVal u;
        Parrot_UInt flags;
 #if ! DISABLE_GC_DEBUG
        UINTVAL _pobj_version;
 #endif
    } pobj_t;</pre>
<p>and where:</p>
<pre>    typedef union UnionVal {
        struct {
            void * _bufstart;
            size_t _buflen;
        } _b;
        struct {
            DPOINTER* _struct_val;
            PMC* _pmc_val;
        } _ptrs;
        INTVAL _int_val;
        FLOATVAL _num_val;
        struct parrot_string_t * _string_val;
    } UnionVal;</pre>
<p><code>u</code> holds data associated with the PMC. This can be in the form of an
integer value, a floating point value, a string value, or a pointer
to other data. <code>u</code> may be empty, since the PMC structure also provides
a more general data pointer, but is useful for PMCs which hold only a
single piece of data (e.g. <code>PerlInts</code>).</p>
<p><code>flags</code> holds a set of flags associated with the PMC; these are documented
in <b><i>include/parrot/pobj.h</i></b>, and are generally only used within the Parrot
internals.</p>
<p><code>_pobj_version</code> is only used for debugging Parrot's garbage collector.
It is documented elsewhere (well, it will be once we get around to
doing that...).</p>
<p><code>vtable</code> holds a pointer to the <b>vtable</b> associated with the PMC.
This points to a set of functions, with interfaces described in
<b><i>docs/pdds/pdd02_vtables.pod</i></b> that implement the basic behaviour of
the PMC (i.e. how it behaves under addition, subtraction, cloning etc.)</p>
<p><code>data</code> (if present) holds a pointer to any additional data associated
with the PMC. This may be NULL.</p>
<p><code>pmc_ext</code> points to an extended PMC structure. This has the form:</p>
<pre>    struct PMC_EXT {
 #if PMC_DATA_IN_EXT
        DPOINTER *data;
 #endif
        PMC *_metadata;
        struct _Sync *_synchronize;
        PMC *_next_for_GC;
    };</pre>
<p><code>data</code> is a generic data pointer, as described above.</p>
<p><code>_metadata</code> holds internal PMC metadata. The specification for this has not
yet been finalized.</p>
<p>XXX: what does <code>_synchronize</code> do?</p>
<p>XXX:  ditto <code>_next_for_GC</code>...</p>
<p>PMCs are not required to have a <code>PMC_EXT</code> structure (i.e. <code>pmc_ext</code> can
be null).</p>
<p>PMCs are used to implement the basic data types of the high level languages
running on top of Parrot. For instance, a Perl 5 <code>SV</code> will map onto one
(or more) types of PMC, while particular Python datatypes will map onto
different types of PMC.</p>
<a name='ATTACHMENTS'></a><h1>ATTACHMENTS</h1>
<p>None.</p>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>The perl modules Math::BigInt and Math::BigFloat. Alex Gough's suggestions
for bigint/bignum implementation, outlined in <b><i>docs/pdds/pdd14_bignum.pod</i></b>.
The Unicode standard at <a href='http://www.unicode.org.' target='_blank'>www.unicode.org.</a></p>
<a name='GLOSSARY'></a><h1>GLOSSARY</h1>
<ul>
<li><a name='Type'></a>Type</li>
<p>Type refers to a basic Parrot data type. There are four such: integers,
floating point numbers (often just numbers), strings and Parrot Magic
Cookies (PMCs).</p>
</ul>
<a name='VERSION'></a><h1>VERSION</h1>
<p>1.4</p>
<a name='CURRENT'></a><h2>CURRENT</h2>
<pre>     Maintainer: Dan Sugalski &lt;<a href='mailto:dan@sidhe.org'>dan@sidhe.org</a>&gt;
     Class: Internals
     PDD Number: 4
     Version: 1.4
     Status: Developing
     Last Modified: 20 February 2004
     PDD Format: 1
     Language: English</pre>
<a name='HISTORY'></a><h2>HISTORY</h2>
<ul>
<li><a name='Version 1.4, 20 February 2004'></a>Version 1.4, 20 February 2004</li>
<li><a name='Version 1.3, 2 July 2001'></a>Version 1.3, 2 July 2001</li>
<li><a name='Version 1.2, 2 July 2001'></a>Version 1.2, 2 July 2001</li>
<li><a name='Version 1.1, 2 March 2001'></a>Version 1.1, 2 March 2001</li>
<li><a name='Version 1, 1 March 2001'></a>Version 1, 1 March 2001</li>
</ul>
<a name='CHANGES'></a><h1>CHANGES</h1>
<ul>
<li><a name='Version 1.4'></a>Version 1.4</li>
<p>Document basic PMC internals. Make clear the fact that the bigint/bignum
description is still provisional. Other minor fixups to make the
documentation match reality.</p>
<li><a name='Version 1.3'></a>Version 1.3</li>
<p>Fixed some silly typos and dropped phrases.</p>
<p>Took all the underscores out of the field names.</p>
<li><a name='Version 1.2'></a>Version 1.2</li>
<p>The string header format has changed some to allow for type
tagging. The flags information for strings has changed as well.</p>
<li><a name='Version 1.1'></a>Version 1.1</li>
<p>INT and NUM are now concepts rather than data structures, as making
them data structures was a Bad Idea.</p>
<li><a name='Version 1'></a>Version 1</li>
<p>None. First version</p>
</ul>
</div>
